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Abstract — We study the problem of compressing a source 
sequence in the presence of side-information that is related to the 
source via insertions, deletions and substitutions. We propose a 
simple algorithm to compress the source sequence when the side- 
information is present at both the encoder and decoder. A key 
attribute of the algorithm is that it encodes the edits contained 
in runs of dilferent extents separately. For small insertion and 
deletion probabilities, the compression rate of the algorithm is 
shown to be asymptotically optimal. 
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Fig. 1 . Structure of the system 
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I. Introduction 

In HI, we have studied the problem of compressing a source 
sequence with the help of mis-aligned decoder-only side- 
information, where the source and side-information are the 
input and output of a deletion channel, respectively. The min- 
imum rate is shown to correspond to the amount of information 
in the deleted content plus the locations of the deletions, minus 
the uncertainty in the locations given the source and side- 
information. We refer to the latter as "nature's secret". This 
is the information that the encoder and decoder can never 
find out. It represents the over-counting of information in the 
locations of the deletions. For example, if the input and output 
of a deletion channel and are (0,0) and (0), the encoder and 
decoder will never know and never need to know whether the 
first or the second bit is deleted. An interesting question is: 
how to construct a practical compression algorithm with the 
optimal compression rate, where the encoded bits do not reveal 
"nature's secret"? In this paper we provide such a construction 
for a simpler problem where the side-information is available 
at both the encoder and decoder Although the availability of 
the side-information is changed, the minimum rate remains the 
same. 

In this paper, we study the problem of compressing a source 
sequence, X, with the help of side-information, Y, which 
is available at both the encoder and the decoder The side- 
information is related to the source via insertions, deletions 
and substitutions. See Figure [T] for an illustration of the 
system. The objective of this work is to construct an encod- 
ing/decoding algorithm to achieve the optimal compression 
rate defined as the minimum number of encoded bits per 
source bit. 



This material is based upon work supported by tlie US National Science 
Foundation (NSF) under grants 23287 and 30149 and by a gift from 
Qualcomm Inc.. Any opinions, findings, and conclusions or recommendations 
expressed in this material are those of the authors and do not necessarily reflect 
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Here is an example of the source and side-information: 



X 
Y 



(0,0,1,1,0,1) 
(0,1,0,0,1,1) 



In order to compare these two sequences, we can insert 
some gaps, which are denoted by '-', to align them as follows. 

X* = (0,0,1,1,0,1,-) 
Y* = (0,-, 1,0,0, 1,1) 

This alignment explains the X with respect to Y with an 
insertion, a substitution and a deletion: X2 is inserted between 
Yi and Y2; X4 substitutes F3; ¥(, is deleted. The encoder needs 
to describe the above editing information using the minimum 
number of bits. 

The problem of synchronizing edited sequences has been 
studied by Q-El assuming the number of edits is a constant 
that does not increase with the length of the sequence. Upper 
and lower bounds on the minimum number of encoded bits 
were provided as functions of the number of edits and the 
length of the sequence. In ||5], an interactive, low-complexity 
and asymptotically optimal scheme was proposed. In compari- 
son, in this paper, we consider the case that a fraction of source 
bits, rather than a constant number of bits, is edited, which 
makes the problem more general. There are also practical 
synchronization algorithms, such as RSYNC |6j for generic 
files and VSYNC IJ], which targets video applications. In the 
special case when the source and the side-information differ 
only by substitutions (side-information is aligned), a universal 
compression algorithm has been proposed by |8|. 

In this paper, we propose a simple compression algorithm, 
for which the compression rate is asymptotically optimal 
when the editing probability is small. The key ideas are: 
(1) describing the locations of insertions and deletions by 
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specifying the runfl of side-information in which they appear, 
and (2) separately encoding the edits that appears in runs of 
different extents. To explain idea (1), consider the example 
where the side-information is Y = (0, 0, 1 , 0) and the source 
is X = (0, 1,0). Neither the encoder nor the decoder knows 
whether the first bit or the second bit is deleted. Therefore the 
encoder needs to describe the location of the deletion only up 
to a run, which consists of the first two bits in this example, but 
not further. To explain idea (2), consider the example where the 
side-information Y = (0,0, 1,0) and the source is X = (0, 1). 
These sequences can be explained by two deletions, in the first 
run and the third run of Y, respectively. If the deletion process 
is memoryless and stationary, the longer first run is more likely 
to contain a deletion than the shorter third run. Therefore 
the two deletion events should be encoded separately, using 
entropy coders with different target distributions, or using a 
universal entropy coder 

Our compression algorithm can find applications in a num- 
ber of settings, for example, to compress genomic sequences, 
as in 1910 The difference between the genomic sequences from 
two individuals of the same species is a small fraction of a 
whole sequence, and is in the form of insertions, deletions and 
substitutions. If one of the genomic sequences can be used as 
side-information, the algorithm can be used to compress the 
other sequence. The algorithm can also be used in distributed 
file backup or file sharing systems, where different source 
nodes have different versions of the same file differing by 
a small number of edits including insertions, deletions and 
substitutions. Here, an old version can be used as side- 
information that is mis-aligned to the new version of the same 
file. 

The rest of this paper is organized as follows. In Section Ull 
we formally setup the problem. In Section |III] we consider 
a simple case where the source sequence is obtained from 
side-information by pure deletion. We present the algorithm 
and analyze the performance. In Section |IV] we present the 
algorithm in the general setup. 

Notation: Symbols in boldface represent sequences or ma- 
trices, and the symbols in non-boldface represent scalars. The 
binary entropy function is denoted by /!2( )- The notation 
{0, 1 )" denotes the «-fold Cartesian product of (0, 1 ), and {0,1)* 
denotes (Utez-fO, l}*-) Ul®}- 

II. Problem Setup 

We will define two sequences X and Y, which differ by 
insertions, deletions, and substitutions. 

First, consider an auxiliary length-« sequence = 
(Zx,i,...,Zx,„) e (0,1)" ~ iid BernoulUO?), where p € (0,1). 
Pass Zx through a binary symmetric channel with crossover 
probabiUty q to get Z^. 

We will then make deletions in Zx and Zj- to construct X 
and Y, respectively. Let the deletion pattern Dx be a length- 
« sequence ~ iid Bernoulli(^fx), which is independent of Zx 

run is the maximal length sequence of a repeated symbol. The extent, 
or length, of a run is the number of times the symbol repeats. 

^^We would like to thank Dr Tsachy Weissman for introducing us to this 
application. 



and Zy. The deleted sequence X e {0, 1)* is a subsequence of 
Zx, which is derived from Zx by deleting all those Zx,,'s with 
Dx,i - 1 . Similarly, the deletion pattern Dy ~ iid Bernoulli((iy) 
describes the deletion process from Zy to Y. 

Since the editing process from Zx to X is a deletion process, 
the inverse process from X to Zx can be regarded as an 
insertion process. Therefore from X to Y there are insertions 
(from X to Zx), substitutions (from Zy to Zy) and deletions 
(from Zy to Y). 

Both sequences X and Y are available to the encoder and 
Y is available only to the decoder as side-information. All the 
other sequences, Zx, Zy, Dx, and Dy are available to neither 
the encoder nor the decoder The encoder encodes X in the 
presence of Y and sends a bit string of variable length to the 
decoder so that the decoder can reproduce X without any error. 
The sequences X and Y are called the source sequence and 
the side-information, respectively. Please see Fig. |2] for the 
structure of the system together with the source model. 

Source model 
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Fig. 2. Structure of the system with the source model 

The performance of the encoder and the decoder is mea- 
sured by the expected operational rate, which is defined as 
Rop := lim„^ca'S,[LM / Ly], where Lm is the length of encoded 
bit string, and Ly is the length of Y. The objective of this 
work is to find an encoder and a decoder which minimize the 
expected operational rate. 

III. Algorithm for the Pure Deletion Case 

In order to provide a clear presentation of our algorithm, 
we start by considering a special case of the general prob- 
lem, where the source sequence X is derived from the side- 
information Y only by deletion, but not substitution or in- 
sertion. Formally speaking, ^ = and dr = 0, which imply 
Zx = Zy = Y. For the sake of simpUcity, in this section and 
Appendix |A] we drop the subscript X in dx and Dx and denote 
them as d and D, respectively. 

A. Algorithm for pure deletion 

The encoder has the following three stages. 
1) Alignment: In this stage we insert some gaps in X 
to get X*, which has the same length as Y. The fol- 
lowing greedy alignment algorithm described in ifTOl 
Section 3.1] is used. 
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Read X and Y from left to right. Take the first bit of 
X, and match it with the leftmost appearance of this 
bit in Y; then take the second bit of X, and match it 
with the subsequent leftmost appearance of this bit in 
Y; and so on. All the bits in Y that are not matched 
with bits from X are matched with gaps denoted by 
Let X* be the aligned version of X with gaps inserted. 
The alignment implies a reconstructed deletion pattern 
D, which can explain the deletion process from Y to X, 
but is in general different from D. 

2) Describing the deletions with respect to runs: 

Let the maximum extent of the runs in Y be Lmax- For 
IID sequence Y, E[Lmax] = 0(log«) ifTTl . 
The encoder performs the following: 
• For / - I, . . ., Lmax, do: 

- Compute Ui, the number of runs of extent / in Y. 

- For / = I, . . .,Ui, compute Vij, the number of 
deletions in the i-th run of extent / in Y according 
to D. 

3) Entropy coding: For each I - 1, . . . ,Lmax, compress the 
sequence {V/,/)|^'[ using an entropy coder. Note that Vij 
with Z = 1, . . . , Lmax have diff'erent distributions. 

The encoded string generated by the encoder is the output 
of the entropy coder in stage 3). 

The decoder has the following two stages. 

1) Entropy decoder: Reconstruct {Vij}^J^^ for each /. 

2) Locate deletions up to runs: For each / and each /, find 
the i-th run of extent / in Y, and delete Vij bits in that 
run. The outcome is the reconstruction of X. 

Since the total number of entries in [Vij] is the total number 
of runs in Y, which is no larger than n, the size of memory 
the algorithm takes is 0(n). Since the greedy alignment, 
the generation and coding of {Vij] take 0(n) operations, the 
algorithm takes 0{n) operations. 

B. Example 

Let the side-information, the hidden deletion pattern, and 
the source sequence be as follows for example: 

Y = (1,0,1,1,0,0,0,1,0,1,1) 
D = (1,0,0,1,0,1,0,0,0,1,0) 
X = (0,1,0,0,1,0,1). 

On the encoder side: 

Stage 1): The greedy alignment algorithm aligns X and Y 
and generates D as follows. 

Y = (1,0,1,1,0,0,0,1,0,1,1) 
X* = (-,0, l,-,0,0,-, 1,0, 1,-) 

D = (1,0,0,1,0,0,1,0,0,0,1). 

Stage 2): The maximum extent of the runs in Y is Lmax - 3. 
There are U\ - A runs of extent 1, = 2 runs of extent 2, 
and - 1 run of extent 3. For the four extent- 1 runs, '1', 
'0', '1' and '0', only the first one is deleted according to D, 
therefore we have 

(Vi,i,Vi,2,yi,3,Vi,4) = (1,0,0,0). 



For the two extent-2 runs, '1,1' and '1,1', there is a deletion 
in each of them. Therefore we have 

(V2.i,y2,2) = (l,l). 

For the only extent-3 run, '0,0,0', there is a deletion in it. 
Therefore we have 

(V3,l) = (l). 

Stage 3): The entropy encoder com- 
presses ((yu,yi,2,Vl,3,Vi,4),(y2,l,V2,2),(V3,l)) 

((1,0,0,0),(1, 1),(1)). Note that each entry in 
Vi,2, V\T,, Vi,4) is more likely to be than (V'2,1, V2,2) and 
(V3j). Therefore we should use entropy encoder with different 
target distributions to encode them, when the sequences are 
long. 

On the decoder side: 

Stage 1): The entropy decoder recon- 
structs ((yi,l,Vl,2,Vl.3,Vl,4),(V2.1,V2,2),(V3,l)) 

((1,0,0,0),(1,1),(1)). 

Stage 2): Since (Vij, Vi,2, Vi,3, Vi,4) = (1,0,0,0), the de- 
coder deletes the first run of extent-1, i.e., the first bit. Since 
(V'2,1, V2,2) - (L 1), the decoder deletes a bit from each of the 
two runs of extent-2. It does not matter which bit to delete in 
each run. Since (V'3,1) - (1), the decoder deletes a bit in the 
only extent-3 run. The deletions are represented by D and the 
reconstruction of the source sequence is denoted by X. 

Y = (1,0,1,1,0,0,0,1,0,1,1) 
D = (1,0,1,0,0,0,1,0,0,0,1) 
X = (0,1,0,0,1,0,1). 

Since X = X, the reconstruction is correct. 

C. Performance of the algorithm 

Let U := {C/;}J:7 and V := [Vij)'j^^:"[. In the limit as 
the lengths of the sequences tends to infinity, the operational 
rate of this algorithm is R„p - lim„^oo H(V)/n. The optimal 
rate is lim„^oo //(X|Y)/«. When the probability of deletion d 
is small, the following theorem shows that the algorithm is 
asymptotically optimal. 

Theorem 1: The gap between the operational rate of the 
algorithm described in Section IIII-AI and the optimal rate 
satisfies: \im„^^[H(W)ln - //(X|Y)/n] = 0{d--'), for any 
e> 0. 

The proof is provided in Appendix |A1 which can be intu- 
itively explained as follows. When d is small, the deletions 
are typically far away from each other. Therefore the inter- 
vals between the deletions are so long that can be used to 
synchronize segments of X to segments of Y. As a result, 
the deletions can be located within the correct runs with high 
probability. The exact positions of the deletions within the 
runs are impossible to find based on only X and Y. Since the 
goal is to reconstruct X, describing the positions within runs 
is unnecessary. Moreover, the description of the locations of 
the deletions, V, is almost independent of the decoder side- 
information Y. Therefore sending V is approximately optimal 
in terms of rate. See Section IIII-D1 2 for more discussions 
about the independence between V and Y. The deletions 
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cannot be located within the correct runs only if two or more 
deletions are in the same run or adjacent runs, which occurs 
with the probability in the order of 0{cfi). Therefore the gap 
between the operational rate and the optimum is in the order 

of 0(d--'). 

Remark 1: In [I], we have shown that when p = 1/2, for 
any e > 0, Hm„^oo H(X\Y)/n - h2{d)-cd+0{cf''^), where c :- 
2"'"'nog2/ ~ 1.290 It captures the asymptotic expansion 
of the optimal rate to the precision of @(d) with a remainder 
term 0{d^-'). Due to Theorem [T] = h2{d) - cd + 0(d^-^), 
which also matches the optimal rate to the precision of @{d). 

Remark 2: In we have shown that lim„^oo H(X\Y)/n is 
also the minimum rate when the side-information is only avail- 
able available at the decoder but not the encoder. Although the 
minimum rate is the same, constructing an explicit algorithm 
to implement the distributed compression at the asymptotically 
optimal rate remains an open problem. 

D. Comparison to other compression algorithms 

Let us compare the algorithm described in Section IIII-AI 
with two simpler but suboptimal algorithms in the simple case 
Y ~ iid Bernoulli(l/2) {p - 1/2). The comparison reveals 
more intuition on why the algorithm is asymptotically optimal. 

1) Sending D directly: A simple and the most natural 
algorithm to compress X given Y is first running a greedy 
alignment to obtain D (as in stage 1)) and then compressing 
D using an entropy coder (similar to stage 3)). As the lengths 
of the sequences tend to infinity, the operational rate is 
lim„^oo H(D)/n. If we approximate H(D) by //(Djf[ the opera- 
tional rate is approximately h2{d) = -lilogj d+d\og2 e+0{d^). 
Therefore for small d, the operational rate of this simple 
algorithm matches the optimal expression up to the -d logj d 
term. But for the @{d) term, there is a gap cd x \.29d. That 
is, this compression algorithm wastes 1 .29 bits per deletion bit 
on average. When d is not very small, -dlogo d and t/ can be 
in the same order of magnitude. Therefore the gap may not 
be negligible in practice. 

The above strategy is suboptimal because D specifies the 
exact positions of the deletions. Note that after specifying the 
runs that contain the deletions and specifying the number of 
deletions in each run, X can already be deduced from Y. How- 
ever, this strategy goes further and specifies the exact positions 
within the runs, which are redundant in terms of reconstructing 
X. Therefore this strategy over-describes the positions of the 
deletions beyond what is necessary to represent X. The amount 
of over-description, //(D|X, Y), is called "nature's secret" in 
|[T], because only the hypothetical party "nature" has access 
to D, but the encoder and decoder do not. 

2) Locating deletions up to runs: The analysis of the 
previous strategy suggests that the encoder should specify the 
location of the deletions with respect to runs. Therefore a 
better algorithm than the one described in Section UlI-DH is 
first defining a sequence W such that Wi is the number of 

'^In Q], Y is defined as tlie deleted version of X. Tlierefore tlie expression 
//(X|Y) in tliis paper coiTesponds to //(Y|X) in [T]. 

^It can be made rigorous using the techniques that are similar to those used 
to prove argument (iii) in Appendix IaI 



TABLE I 

Performance of compression algorithms forn = IQQQkb, d = 0.01. 



p 


No SI 


Sec.lIII-DU 


Sec. IIII-DJ2 


Sec. IIII-AI 


0.5 


990kb 


81kb 


71kb 


68kb 


0.1 


469kb 


81kb 


63kb 


46kb 



deletions in the i-ih run of Y according to D, then compressing 
W at the entropy rate. 

Since the average extent of a run in an iid Bernoulli(l/2) 
sequence is 2, the length of W is approximately half of that of 
D. It can be showrd that the operational rate can be approxi- 
mated by {h2{d)-d). There is still a linear d gap between this 
rate and the optimal one, given hy {c - l)d x 0.29c/. That is, 
this algorithm wastes 0.29 bit per deletion bit. 

Why is this algorithm suboptimal? The reason is because 
W is significantly correlated with Y. If the deletion process is 
iid, then the longer runs of Y tend to contain more deletions 
and the shorter runs tend to contain less deletions. Therefore Y 
reveals a certain amount of information about W, that is about 
0.29 bit per deletion bit. The algorithm described above does 
not use this amount of information and thus is suboptimal. 

The algorithm described in Section IIII-AI however, treats 
the deletions contained in runs of different extents differently. 
As a result the operational rate matches the optimal rate for 
the @(d) term. 

TableUprovides a comparison among the performance of the 
two algorithms in Section IIII-DI and the one in Section IIII-AI 
for n - lOOOkb and d - 0.01. Note that when Y has biased 
bits {p - 0.1), the benefit of the proposed algorithm in 
Section IIII-AI is more significant than when p - 0.5. The 
reason is that when p - 0.1, the runs of Y are longer and it 
pays to exploit the information from the run-lengths. 

IV. Algorithm for the General Case 

The algorithm described in Section IIII-AI can be extended 
to the general problem where Y is related to X by insertions, 
deletions and substitutions. 

A. Algorithm for insertions, deletions and substitutions 
The encoder has the following stages. 

1) Alignment: align X and Y using the minimum total 
number of insertions, deletions and substitutions. If there 
are multiple such alignments, pick any one of them. This 
can be done by the Needleman-Wunsch algorithm lfT2l 
with the gap penalty and the substitution penalty equal 
to 1, with computation complexity of order 0(n~). The 
algorithm generates two sequences X* and Y*, which 
are X and Y with gaps, respectively. Then construct 
Zx and Zy by replacing the gaps in X* and Y* by the 
corresponding bits in Y* and X*, respectively. 

2) Describing the insertions (from Y to Zy)'- 

The edits from Y to Zy can be viewed as insertions. 
The locations of the insertions are specified by the gaps 

^Using the techniques that are similar to those used to prove argument (iii) 
in Appendix IaI 
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in Y*. The content of the insertions is specified by the 
corresponding bits in Zy. 

All the insertions can be categorized into isolated in- 
sertions with only one bit per insertion event, and 
bursts of insertions with two or more consecutive bits 
per insertion event. For each insolated insertion, if the 
inserted bit is equal to the bit on the left (or right) side, 
the insertion is extending the run to the left (or right). If 
the inserted bit is not equal to the bits on either side, it is 
breaking an existing run and creating a new run. We will 
describe the isolated insertions that extend runs, then the 
insertions that break runs, then the bursts of insertions. 
. In order to describe the insertions that extend runs, 
the encoder does the following. 

- For / = 1 , . . . , Lmax (imax is the the maximum 
extent of the runs in Y), do: 
* For / = l,...,Ui {Ui is the number of runs 
of extent / in Y), let Vj'" := 1 if the /-th run 
of extent / in Y is extended by one bit, and 
V'lY := otherwise. 
Having made such insertions, Y becomes Y'. 
• In order to describe the insertions that break runs, 
the encoder does the following. 
In the sequence Y', a slot between two bits is a 
potential location to break a run only if the two 
bits are the same. The slots before the first bit and 
after the last bit are also potential locations to create 
new runs. Let Uq denote the total number of such 
potential locations in Y'. For / = l,...,Uo, let 
VqY := 1 if a bit is inserted in the ;-th potential 
location, and V"" := otherwise. 
Having made such insertions, Y' becomes Y". Let 
V"" denote all the descriptions up to this step: 

. In order to describe the bursts of insertions, the en- 
coder creates a sequence V*'""" from Zy by keeping 
the bursts of inserted bits and replacing the other 
bits by '*'. V*""' describes the insertions needed to 
construct Zy from Y". 

3) Describe the substitutions (from Zy to Zx): 

The edits from Zy to Zx can be viewed as substitutions, 
which can be described by V"'* := Zy ® Zx- 

4) Describe the deletions (from Zx to X) as in stage 2) of 
Section UlI-AI Denote the description by V'''''. 

5) Entropy coding: Use an entropy coder to compress V"", 

■ybui-st -ysuh jjjj^ 'ydel 

The decoder decodes V"", V*"''", V"'* and V''"' by an 
entropy decoder, and then follow the stages 2) to 4) to 
construct X from Y. 

B. Performance analysis 

The operational rate of the above algorithm can be analyzed 
for small probability of insertion, deletion and substitution as 
follows. 

Theorem 2: The gap between the operational rate of the 
algorithm described in Section IIV-AI and the optimal rate 



satisfies: lim„^oo[//(V'"'', V'^""', V"'*, V"''"')/" - //(X|Y)/«] = 
0{d~^^), for any e > 0, where d = max{dx, dy, q}. 

The proof is similar to that of Theorem [T] and is provided 
in Appendix iBl 

Intuitively, when the editing probabilities dx, dy and q 
are small, the edits are typically far away from each other. 
Therefore the intervals between the edits are so long that the 
segments of X in the intervals can be correctly matched to 
the corresponding segments of Y. As a result, the edits can be 
isolated. The operational message rate is approximately equal 
to the summation of the message rates in the pure deletion 
problem, the pure insertion problem and pure substitution 
problem. On the other hand, the conditional entropy rate 
lim„^oo H(X\Y) can be also approximated by the conditional 
entropy rates of the pure deletion problem (lim„^oo H{X\Zx)), 
the pure substitution problem (lim„^co H{Zx\Zy)), and the pure 
insertion problem (lim„^oo //(Zy|Y)), with an approximation 
gap no more than 0{d~^'^). Therefore the algorithm described 
above is asymptotically optimal. 

V. Concluding Remarks 

We have studied the problem of compressing a source se- 
quence in the presence of side-information that is mis-aligned 
to the source due to insertions, deletions and substitutions. We 
have proposed an algorithm to compress the source sequence 
given the side-information at both the encoder and decoder 
For small insertion and deletion probability, the compression 
rate of the algorithm is asymptotically optimal. Directions 
for future work include (1) developing algorithms for bursty 
insertions, deletions, and substitutions, and (2) developing 
distributed algorithms to compress a source sequence when 
the reference sequence is only available at the decoder side. 

Appendix A 
Proof of Theorem[T] 

Stage 2) of the algorithm described in Section IIII-AI com- 
presses the reconstructed deletion pattern D and to generate 
V. Let V be the output if the true deletion pattern D would be 
used as the input. Note that the sizes of V and V are identical, 
because they are both determined by Y. 

We have 

H(X\Y) = H(X, V|Y) - H(\\X, Y) 
= //(V|Y)-//(V|X,Y) 
= H(Y) - /(V; Y) - H{\\X, Y), 
= h(V) - /(V; Y) - H{\\X, Y) - (//(V) - //(V)), 

where step (a) is because X is determined by Y and V. We will 
prove the following three arguments: (i)lim„^oo /(V; Y)/n - 0, 
(ii) lim„^oo/(V|X,Y)/« = 0(d~-') for any e > 0, and (iii) 
hm„^oo \H(y) - H(\)\/n = 0(d^-') for any e > 0. 

Proof of argument (i): Given Ui - ui, {Vz,/},^'j is an iid 
sequence with distribution pvniv) - (')'^' (1 ~ d)'^''. Therefore 
Y - U - V forms a Markov chain. By the data processing 
inequality, we have lim„^oo /(X; V)/« < lim„^oo //(U)/« = 0. 
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Therefore (i) is proved. 

Proof of argument ( ii): Let an extended run be a run along 
with one additional bit at each end of the run lfT3l . We call an 
extended run of Y atypical if it contains more than one deletion 
according to D. Let D* be the sequence that is identical to D 
in the atypical extended runs, and is equal to '*' otherwise. 
Suppose there are K runs of '*'s in D*, and the /-th run starts 
from position a, and ends at position b,. Let C, :- y!'L ^j- 
Let C := (Ci,...,C^). 

With the help of D* and C, aligning X and Y becomes 
easier. The atypical extended runs divide the whole sequences 
into K segments. One can locate K segments in X, each 
of which corresponds to a run of '*'s in D*. Within each 
segment, there are no longer atypical extended runs, and 
the deletions can be located in the correct runs without any 
ambiguity IfTTl Proof of Lemma IV.4]. Since V is only about 
the locations of deletions up to runs, //(V|X, Y, D*, C) = 0, 
which implies that i/(V|X, Y) < //(D*,C). Since an extended 
run is atypical with probability 0{cfi), E[/r]/« - 0{d~) and 
H(D*, C)/« = 0{d~^^) for any e. Therefore argument (ii) holds. 

Proof of argument ( Hi): We need to compare the compressed 
representation of the true deletion pattern D and that of the 
reconstructed deletion pattern D generated by the greedy aUgn- 
ment algorithm. We introduce a sequence A - (Aq, Ai , . . . , A„) 
to indicate the difference between D and D. 

Let Ao := 0. For ;■ = 1 , 2, . . . , n, let A,- A,_i + D, - D,. The 
condition A, = means that the greedy alignment algorithm 
is aligning F, to the correct bit in X. Given A,_i and D,, the 
value of A; is as follows. 

1) If A,_i = and D, = 0, then D,- must be and hence 
A; =0. 

2) If A,_i = and D, = 1, then D,- = F, e Yj, where Yj is 
the next undeleted bit. Since Y ~ iid Bernoulli(/7), D, ~ 
Bernoulli(2/?(l -/?)). Therefore A, ~ Bernoulli(l -2p + 

3) If A,_i 9i 0, either D, = or D, = 1, we have D,- = 
Yj ffi Yj, where Yj is the next undeleted bit. Therefore 
Di ~ Bernoulli(2/:>(l -p)). Therefore A, = A,_i +Di-Di 
where D, and D, are independent, D, ~ Bernoulli(ii) and 
Di ~ Bernoulli(2/7(l -/?)). 

Therefore A is a first order Markov chain with the following 
transition probabilities: P(Ai = l|A,_i = 0) = 1 - P(A,- = 
0|A,_i = 0) = d(l -2p + 2p2). For k 0, P(A,- = + l|A,_i = 
k) = d{l -2p + 2p\ P(A,- = ;t|A,_i ^ k) ^ 2p(\ - p), and 
P(A,- ^ k - l|A,_i = ;t) = (1 - d){l - 2p + 2p^). An important 
property of A is that, when d I, starting from an arbitrary 
state, the Markov chain returns to the state in (9(1) steps 
on average. Therefore if the output of the greedy alignment 
algorithm disagrees with the true deletion pattern at some 
symbol, they will come back to an agreement in (9(1) steps. 

When we read D and Y from left to right, if there is 
a deletion (Z), = 1 for some /) and the run in Y that 
follows the run containing the deletion is not completely 
deleted, then the greedy alignment algorithm can locate the 
deletion in the correct run. For example, if Y = (0, 0,1,0) 



and D - (1,0^0,0), the first '0' is deleted. The algorithm 
will generate D = (0,1,0, 0), locating the deletion at the 
second bit, which is in the same run as the first bit. Since 
the compressed representations V and V are only about the 
locations of deletions up to runs, the corresponding entries in 

V and V related to this deletion are identical. 

If there is a deletion, and the run in Y that follows the 
run containing the deletion is completely deleted, then the 
greedy alignment will locate some deletions in wrong runs. For 
example, if Y = (0, 0, 1, 0) and D = (1, 0, 1, 0), the first '0' and 
the ' r are deleted. The algorithm will generate D = (0, 0, 1 , 1), 
locating the deletion of '0' incorrectly in the third run instead 
of the first run. Since such an event requires at least two 
deletions in consecutive runs, it occurs with probability O(d^). 
Since D and D will return to an agreement in (9(1) steps, 
with high probability, n ■ 0{d^) deletions may be placed in 
wrong runs by the greedy alignment algorithm throughout the 
sequence. Therefore up to n ■ 0{d^) entries of V and V can be 
diff'erent. Hence the entropy of the component-wise difi'erence 
is //(V-V) = 0(^/2). _ _ _ 

Therefore |//(V)-//(V)|/n = |//(V|V)-//(V|V)|/« < 2//(V- 
V)/n = O(d^), which completes the proof of argument (iii) and 
Theorem [T] 

Appendix B 
Proof of Theorem |2] 

In this appendix, let V denote (V'"', V^""', V™\ V''"')- Let 

V = (V"", V'"'"', V™^ V'''^') denote the corresponding descrip- 
tion of the isolated insertions, bursty insertions, substitutions 
and deletions if the underlying sources Zx, Zy, Dx and Dy 
are used. Note that the entries where both Dx and Dy specify 
deletions are not considered as edits at all. The probability 
that such an entry occurs is 0{d^). 

As in the proof of Theorem [T] we have 

H{X\Y) = H{X, V|Y) - H(\\X, Y) 

= H{\\Y) - H(\\X,Y) 

= Hi\) - Y) - H{\\X, Y), 

= H{\) - /(V; Y) - H{\\X, Y) - (//(V) - H{\)), 

We will prove the following three arguments: 
(i)Um„^oo /(V; Y)/« = 0{d^-') for any e > 0, (ii) 
lim„^oo/(VJX,Y)/« = 0{d^-') for any e > 0, and (iii) 
lim„^oo |//(V) - H(Y)\/n = 0(d^-') for any e > 0. 

Proof of argument (i): /(V; Y) = /(V""; Y)-H/(V*""'; Y|V"'') + 
y^^yjui. Y|Y""^ Y*«"') + y^^-yiie/. Y|y""^ y''"''s'^ ys"'')^ Dug to the 

same reason as in the proof of argument (i) in Appendix lAl 
/(V""; Y) = o(«). Since the bursty insertion appears with 
probability 0(d-), I(Y'"'"'; Y|V'"') < //(V*'"'") = n ■ 0(d^-'). 
Since the substitutions represented by V*"* are independent 
of (Y,Dy,Dx), /(y™'';Y|V'«% ¥''""') = 0. Due to the same 
reason as in the proof of argument (i) in Appendix |A] 
/(¥'''■'; Y|V'"', V*""', V"''0 = 0(n). Combining these four 
terms we have proved argument (i). 

Proof of argument (ii): The sequences Zx, Zy, Dx and Dy 
imply edits including insertions, deletions and substitutions. 
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Let us define the neighborhood of an edit as follows. The 
neighborhood of a substitution at position ; is the substitution 
together with the first run starting at position and the first 
bit of the second run. For example, when Zx - (0, 1, 1,0,0), 
Zy - (1,1,1,0,0), there is a substitution at position / = 1, 
the neighborhood of which consists of the first four bits. The 
neighborhood of a deletion in Dx at position / is the run in Zy 
that contains the deletion, which ends at position j, together 
with positions j + I, . . . , j + k, where k is the smallest integer 
satisfying k > 2 and Zyj+it + 'Z^Yj+k-i- For example, Zx = 
Zy = (1, 1,0, 1,0,0), Dx = (1,0,0,0,0,0), there is a deletion 
at position / = 1. The run containing the deletion ends at 
position j - 2, and k - 4. Therefore the neighborhood of this 
deletion consists of all six bits. The neighborhood of a deletion 
in Dy is similarly defined. The concept of neighborhood is 
plays the same role as the "extended run" in the proof of 
argument (ii) in Appendix [A] because knowing that there is 
no other edit within the neighborhood of the first edit, without 
any ambiguity, the first edit can be located in the correct run if 
the edit is a deletion or insertion, and can be located precisely 
if it is a substitution. 

When an edit appears and another edit appears within the 
neighborhood of the first edit, we call this neighborhood 
atypical. Let Z^, ZJ,, and DJ, be the sequences that are 
identical to Zx, Zy, Dx and Dy in the atypical neighborhoods, 
and take the value '*' otherwise. Thus the sequences are 
divided by the atypical neighborhoods into K segments of 
'*'s. Let C'"", C™*, and Cf"' be the numbers of insertions, 
substitutions and deletions in the i-th run, respectively. Let 

(f^ins f^suh f^del f^ins f^sub /^del\ 

■ '1 '1 '•■•'^j;;''^^ ^ K ^' 

With the help of ZJ, Z*, D^ and D* and C, aligning X 
and Y becomes easier. The atypical neighborhoods divide 
the whole sequences into K segments. One can locate K 
segments in X and Y, each of which corresponds to a 
run of '*'s in Z^, ZJ,, D^ and DJ,. Within each segment, 
there are no longer atypical neighborhoods, and the edits 
can be located in the correct runs for insertions and 
deletions and can be located precisely for substitutions. 
Therefore //(V|X, Y,Z;.,Z*,D;,,D*,C) = 0, which implies 
that //(V1X,Y) < //(ZJ,Z*,D;,,D*,C). Since an atypical 
neighborhood appears with probability Oicfi), E[K]/n - O(d^) 
and H(Z*^,Z*y,T)';^,Wy,C)/n = 0{d^-') for any e. Therefore 
argument (ii) holds. 

Proof of argument (Hi): Stage 1) of the algorithm specified 
in Section IIV-AI generates a reconstructed alignment with the 
minimum number of edits, which can be compared with the 
original alignment specified by Zx, Zy, Dx and Dy. 

For each edit in the original editing process, if there is no 
other edit in its neighborhood, the reconstructed alignment 
must locate the correct type of edit within the correct run 
for if the edit is an insertion or a deletion, or at the correct 
position if the edit is a substitution. Otherwise the erroneous 
alignment leads to at least another edit in the neighborhood, 
which violates the assumption that the reconstructed alignment 
has the minimum number of edits. 

If there is at least another edit in the neighborhood of 
the previous edit, so that the neighborhood is atypical, the 



reconstructed alignment is not guaranteed to be the same 
as the original alignment. Since such event occurs with the 
probability in the order of 0{d^), the number of atypical 
neighborhoods is in the order of « ■ 0(d~). Therefore V and V 
differ by no more than n ■ 0{d^) entries. Therefore argument 
(iii) holds. 
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